FilCDN: SP performance metrics

Problem Description

The Pandora Service provides an optional CDN retrieval service. This service allows a user to retrieve their data using filCDN. The user has to pay of this optional service (See M2 payment setup here). Part of this fee goes to the CDN service and part goes to the SP offering retrieval.

The CDN service allows for fast retrievals independent of geolocation through a simple HTTP request.
A user wants to make sure that the service they are paying for is being delivered. To account for this, there needs to be a mechanism on-chain which makes the payout to the SP conditional on certain performance metrics. Only if the SP adheres to these metrics and can provide a retrieval service that is within the Pandora Service agreement, the SP should be rewarded with their share.

Performance Metrics

The performance metric used to evaluate the CDN retrieval service needs to be as simple and easy to understand as possible while accurately representing the quality of the service that the user paid for.

When a user is making a request to filCDN to retrieve some data they previously stored on an SP, the filCDN checks their cache and if there is a cache miss on that data it forwards the request to the SP. If the SP offers retrieval for this data, filCDN then retrieves it from the SP and forwards it to the user, caching the data in the process.

This means that a performance metric which should decide on the SP’s share of the CDN service fee should be tied to how well an SP is servicing retrievals.

A straightforward representation of this is a Retrieval Success Rate, which is defined by:
$rsr = \frac{SuccessfulRequests_{CacheMiss}}{TotalRequests_{CacheMiss}}$

Timeframe for calculation

To calculate $rsr$ the total number of requests upon cache misses that are forwarded to the SP needs to be computed. There needs to be a timeframe that is stipulated for which the $rsr$ should be calculated for, meaning that only requests within this timeframe should flow into the calculation of the $rsr$ .

Aggregate vs Time-Boxed Scores

To be able to decide on the timeframe for calculating the $rsr$ score, its use on-chain needs to be accounted for. A user wants to make sure that what they are paying for is also enforced. Let’s say that the Pandora Service expects all SPs to offer 95% retrieval scores when CDN is enabled. This means that 95% of all cache miss requests to the SP need to be successful. If a SP meets this requirement, then they get 100% of their share and 0% otherwise. A user is interested in the retrievability of CIDs corresponding to a proofSetId for as long as proofs are created for this proofSetId and verified through the PDP Verifier. This indicates one possible timeframe for calculating the $rsr$ score. However, it is not certain how long this time frame is when a proofSetId is created and so an aggregated $rsr$ score would need to be calculated on a rolling bases with the score representing the retrievability for a proofset and provider from when the proofSetId was created until now.

The conditional payment for the SP’s share of the CDN fee could then be done by taking into account the current rolling $rsr$ score.

Advantages of the aggregated approach:

Simple calculation

Few datapoints as it is only a single value per (provider,proofSetId) pair

Low complexity as there is no debate about which $rsr$ score is computed for what timeframe

Disadvantages of the aggregated approach:

If the first or first few cache miss requests are unsuccessful and the $rsr$ score therefore is low or 0 and the user then does not make any more retrieval attempts after that, the SP essentially does not get rewarded anything in the future for making retrievals available.

A different approach is to provide $rsr$ scores for specific timeframes regularly. Each $rsr$ score is calculated for, for example, the last 24 hours.

Advantages of the time-boxed approach:

More accurate representation of the quality of the retrieval service. If a user does not retrieve anything, the SP still gets their share for potentially offering it when needed.

Disadvantages of the time-boxed approach:

There is a lot more data that needs to make its way on-chain and thus can lead to high costs in smart contract storage and gas fees.

It is not clear which timeframe would both fit the entitity that records and calculates the $rsr$ scores (monitoring service) and the one that performs the $rsr$ score checks and payouts to the SPs. The smallest timeframe is the filecoin epoch which is roughly every 30 seconds. This would mean that every 30 seconds for every provider,proofSetId which has seen a cache miss in the last 30 seconds will have a new $rsr$ score.

The scaling of the time-boxed solution is questionable as the amount of on-chain traffic directly scales with the number of cache miss requests for unique provider,proofSetId pairs.

On-Chain Data

The $rsr$ scores need to be stored on-chain so that the conditional payouts to SPs can be completed. This data will be stored on the CDNVerifier, which will work similarly to the PDPVerifier. The design of CDNVerifier is not yet completed but one key feature will be that $rsr$ scores for SPs and proofsets will be stored by this contract.

There are a few options to how the CDNVerifier can meet its requirements.

It can simply store a single value per provider,proofSetId pair

It can store a single value per provider,proofSetId,calculated_at_epoch pair where calculated_at_epoch indicates when the $rsr$ score was calculated.

it can store multiple provider,proofSetId,epoch values, and when asked for a $rsr$ score it can compute it based on the epoch timeframe that was provided in the request for the $rsr$ score. Essentially summing up the scores for all epochs requested and dividing it by the number of scores.

The latter approach leads to a lot of data stored on chain and therefore is the least preferable option. The second approach means higher gas fees as more requests are made, higher computational costs on the monitoring service but a higher degree of accuracy of the $rsr$ score.

The third approach is the simplest, least expensive, but also the least accurate $rsr$ score.

Suggested Approach

The monitoring service calculates and stores the $rsr$ scores per provider,proofSetId pair locally and push them on chain regularly. The interval at which it pushes updates on chain can be flexible. For M2 every 24 hours should be enough. This means that every 24 hours, every provider,proofSetId that has seen cache miss requests in the past 24 hours will have a new $rsr$ score and push it to the CDNVerifier contract. This approach makes it flexible how often we want scores to be reported and limits the complexity of the implementation so that M2 stays realistic for shipping it.

~~A user makes a deal through the Pandora Service with an SP for storing data.~~ ~~This deal is indentified by the proofset Id. The user also specifies whether they want to create have a CDN~~ ~~enabled~~.

~~Each time a proofset is created a~~ ~~fee~~ ~~is paid by the user and the payout is triggered by the~~ SP ~~through the PDP Verifier contract.~~

~~In a recent~~ ~~proposal~~ the payment flow was specified for also paying the SP for enabling CDN retrievability. A user now wants to make sure that the fee they are paying to an SP is based on a service level agreement in Pandora Service that holds the SP to a certain standard of r~~etrievability.~~

There are multiple metrics to record SP performance (amount of egress served, retrieval success, bandwidth, retrieval speed,…). For now, the retrieval success is the most important metric for a user.

~~This means a user only wants to pay a fee to the SP if they delivered on their promise to offer retrieval.~~

~~To make this payment work based on this condition, there needs to exist on-chain data about the current retrieval score of an SP for a given proofSet id.~~

~~This data can be stored in what we call a filCDN verifier contract.~~ ~~The Pandora Service can then ask the filCDN Verifier contract for that data, given an SP and proofSet id and the payment rail can be adjusted on whether the metric is sufficient or not, given the prior service level agreement.~~

To bring this data on chain we can accumulate the data which we are already recording about retrievals in a new table. The calculation of this table can be triggered in regular intervals (per day?) and then posted on chain to the filCDN verifier contract.

A possible table schema could look like the following:

CREATE TABLE IF NOT EXISTS provider_scores (  
address TEXT NOT NULL,  
proof_set_id,
rsr INTEGER NOT NULL,  
calculated_at DATETIME NOT NULL,  
PRIMARY KEY (address, proof_set_id, calculated_at),  
CONSTRAINT check_positive_rsr CHECK (rsr >= 0));

The calculation of the provider_score is done like so:

ts = lastTimeTheScoreWasCalculated(proofSetId,spAddress)
now = Now()
// Number of successful retrievals in the time window now -> ts
successfulRetrievals = countSuccessfulRetrievals(spAddress,proofSetId,ts,now)
totalRetrievals = countTotalRetrievals(spAddress,proofSetId,ts,now) 
rsr = successfulRetrievals/totalRetrievals

Successful retrievals are retrievals where the status of the response is 200. We only make retrievals for CIDs which we know of must exist on the SP. As long as cloudflare works, then this is a reliable statistic.